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Comparison of Performance of ARIMA and LSTM Models for 


Stock Price Prediction 


1 Introduction 


The prospect of making significant returns makes stock markets attractive to traders, investors 
and professionals alike. Investing in the stock market is risky owing to many factors ranging 
from macroeconomic to microeconomic factors, government policies, individual company's 


financial performance, conflicts, and natural disasters. 


Therefore, predicting stock prices is of significant interest to market participants as it can assist 
in making predictable returns, deciding investment strategies, asset allocation and portfolio 


management. 


Over the years, experts have developed numerous mathematical models to identify underlying 
patterns from market data and forecast stock prices. Auto-Regressive Integrated Moving Average 


(ARIMA), a popular statistical model for forecasting a time series that can be made stationary.! 


! Fuqua School of Business. Introduction to ARIMA Models. https://people.duke.edu/-rnau/41 larim.htm. Accessed 
10 Dec 2023. 
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The recent advances in artificial intelligence brought to prominence Recurrent Neural Networks 
(КММ) such as Long Short-Term Memory (LSTM), which employ deep learning to identify 


complex underlying data patterns and make predictions.” 


However, despite their popularity, limited research is available which establishes the superiority 
of ARIMA or LSTM in making accurate predictions of stock prices. Findings of Kobiela et al. 
suggest that ARIMA is more accurate than LSTM, while findings of others, like Ma and Siami 


Namini et al., suggest the opposite.? 4 


This extended essay aims to compare the accuracy of both models in predicting stock prices to 
determine which model offers superior performance based on empirical evidence by answering 


the research question: 


“Which model, ARIMA or LSTM, demonstrates superior accuracy in predicting stock prices 


based on empirical evidence?” 


2 Methodology 


The study involved the following steps: 


? Kobiela, Dariusz, et al. “ARIMA Vs LSTM on NASDAQ Stock Exchange Data." Procedia Computer Science, vol. 
207, Jan. 2022, pp. 3836-45. https://doi.org/10.1016/j.procs.2022.09.445. Accessed 10 Dec 2023. 


? Kobiela, Dariusz, et al. “ARIMA Vs LSTM on NASDAQ Stock Exchange Data." Procedia Computer Science, vol. 
207, Jan. 2022, pp. 3836—45. https://doi.org/10.1016/j.procs.2022.09.445. Accessed 10 Dec 2023. 


^ S. Siami-Namini, N. Tavakoli and A. Siami Namin, "A Comparison of ARIMA and LSTM in Forecasting Time 
Series," 2018 17th IEEE International Conference on Machine Learning and Applications (ICML AJ), Orlando, FL, 
USA, 2018, pp. 1394-1401, doi: 10.1109/ICMLA.2018.00227. PDF available at https://sci- 
hub.se/10.1109/ICMLA.2018.00227. Accessed 10 Dec 2023. 
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e Obtaining historical data for selected stocks and preprocessing it for use with 
ARIMA/LSTM models. 

e Splitting the pre-processed data into training/testing sets. 

e Estimating optimal ARIMA and LSTM models and fitting them to the training data. 

e Making predictions for the test set. 

e Evaluating the performance of both models using statistical metrics such as Root Mean 
Square Error (RMSE). 


e Analyzing both models using the performance metrics to draw conclusions. 


A more thorough discussion on Experimental methodology followed is given in Section 4. 


3 Theoretical Background 


A theoretical background of LSTM and ARIMA relevant to answering the research question is 


discussed here. 


3.1 Time Series 


A time series is an ordered sequence of data points indexed by time, such as stock's daily closing 


prices, product's monthly sales, and annual rainfall in a region. ? 


5 Hayes, Adam. “What Is a Time Series and How Is It Used to Analyze Data?” Investopedia, 13 June 2022, 
www.investopedia.com/terms/t/timeseries.asp. 
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3.2 Time Series Analysis and Forecasting 


Time Series Analysis involves analyses to gain meaningful insights from time series data. 


Time Series Forecasting is the process of predicting future values of a time series based on 


historical data." 


3.3 Stationarity 


Time series data is often non-stationary, meaning that statistical properties like the mean and 
variance change over time due to inherent trends, seasonality, cyclical fluctuations, and random 
noise. Modelling and predicting non-stationary time series data is challenging needing 


specialized techniques.? 


3.4 Auto Regressive Integrated Moving Average (ARIMA) Model 


ARIMA is a statistical model for forecasting time series data which can be made stationary by 


employing techniques such as differencing or nonlinear transformations. A stationary time series 


has no trend, constant amplitude, and consistent short-term random time patterns? 


$ Gupta, Sakshi. “What Is Time Series Forecasting? Overview, Models &Amp; Methods.” Springboard Blog, 28 
Sept. 2023, https://www.springboard.com/blog/data-science/time-series-forecasting/. Accessed 10 Dec 2023. 


7 Gupta, Sakshi. *What Is Time Series Forecasting? Overview, Models &Amp; Methods." Springboard Blog, 28 
Sept. 2023, https://www.springboard.com/blog/data-science/time-series-forecasting/. Accessed 10 Dec 2023. 


8 Gupta, Sakshi. “What Is Time Series Forecasting? Overview, Models & Amp; Methods.” Springboard Blog, 28 
Sept. 2023, https://www.springboard.com/blog/data-science/time-series-forecasting/. Accessed 10 Dec 2023. 


? Fuqua School of Business. Introduction to ARIMA Models. https://people.duke.edu/-rnau/41 1arim.htm. Accessed 
10 Dec 2023. 
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3.4.4 ARIMA Forecasting Equation 

A forecasted value of a stationary time series can be expressed as a weighted sum of previous 
observations (referred as the lagged observations or the Autoregressive terms) and/or a weighted 
sum of previous forecast errors (referred as the lagged errors or the Moving Average terms) and 


a constant. '? 


The ARIMA forecasting equation can be expressed as:!! 


Ж =U + фу ++ PpYt-p — 01€:-1 — 7 — Og et-q 
Where, 


e ¥, is the predicted value at time t 

e wis the constant term representing the mean of the series. 

• У1,..,Уг-р are the past values of the series at times t — 1, ...,t — p, also called lags of 
the series or the AR terms. 

© 6,1,..,е-а are the past forecast errors at times t — 1,...,¢ — q, also called the MA 
terms. 

е фі,..., фр are the parameters of the AR terms. 

° ЖЕТІ 0, are the parameters of the MA terms. 


e pis the number of AR terms, also called the AR order. 


10 Fuqua School of Business. Introduction to ARIMA Models. https;//people.duke.edu/-rnau/41 larim.htm. Accessed 
10 Dec 2023. 


1 Fuqua School of Business. Introduction to ARIMA Models. https://people.duke.edu/~mau/41 larim.htm. Accessed 
10 Dec 2023. 
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e q is number of MA terms, also called the MA order. 


3.4.2 ARIMA Order 

ARIMA, denoted as ARIMA(p,d,q), is thus a combination of the Autoregressive model (AR 
terms) and the Moving Average model (MA terms). The time series that needs to be differenced 
to be made stationary is the Integrated version of the stationary time series. The differencing is 
done by subtracting the previous observation from the current observation. The parameter d 
refers to the number of times the integrated version of the equation needs to be differenced to 


make the time series stationary.'* 


The parameters p, d and q, collectively referred to as the order of the ARIMA model, must be 


tuned prior for optimal results. 


3.4.3 Forecasting with ARIMA Models 


Forecasting with ARIMA using the Box-Jenkins Methodology involves the following steps: 


3.4.3.1 Identification. 


Identification step entails estimating the order (values of p, d, and q). This is achieved by: 


Differencing the data iteratively until stationarity is achieved, which can be confirmed by 


statistical tests such as the Augmented Dickey-Fuller (ADF) test and the Kwiatkowski-Phillips- 


12 Fuqua School of Business. Introduction to ARIMA Models. https;//people.duke.edu/-rnau/41 larim.htm. Accessed 
10 Dec 2023. 


ІЗ *Box-Jenkins Methodology." Columbia University Mailman School of Public Health, 3 Oct. 2022, 
www.publichealth.columbia.edu/research/population-health-methods/box-jenkins-methodology. 
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Schmidt-Shin (KPSS) test. The number of times differencing is applied to make the data 


stationary gives an estimate of the order of differencing, the d parameter in ARIMA(p,d,q).* 


Visualizing data through decomposition, autocorrelation (ACF), and partial autocorrelation 
(РАСЕ) plots. The order of autoregression (p) and order of moving average (4) can be 


determined by observing the lags in these plots.? 


3.4.3.2 Estimation 

The estimation step involves configuring and optimizing the ARIMA model with the estimated 
values of p, d, and q from the previous step and the training data. The performance of the model 
is evaluated using metrics such as the Akaike Information Criterion (AIC) and Bayesian 
Information Criterion (BIC), and the model which minimizes both AIC and BIC values is chosen 


for forecasting. '© 


3.4.3.3 Validation. 


The performance of the chosen ARIMA model is then evaluated on testing data. It involves: 


e Forecasting. 


'4 Stationarity and Detrending (ADF/KPSS) - Statsmodels 0.15.0 (+200). 
www.statsmodels.org/dev/examples/notebooks/generated/stationarity detrending adf kpss.html. 


15 [amleonie. “Time Series: Interpreting АСЕ and РАСЕ” Kaggle, 15 Mar. 2022, 
http://www.kaggle.com/code/iamleonie/time-series-interpreting-acf-and-pacf. Accessed 12 Dec 2023. 

16 Brownlee, Jason. “Probabilistic Model Selection With AIC, BIC, and MDL.” MachineLearningMastery.com, 27 
Aug. 2020, https://machinelearningmastery.com/probabilistic-model-selection-measures .20 Dec 2023. 
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e Evaluating the model's accuracy using metrics such as Root Mean Squared Error 


(RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (МАРЕ).!” 


3.4.4 Employing Search Methods 

Alternately, the order of the ARIMA model can also be estimated employing search algorithms 
like grid search or random search, seeking the optimal combination of parameters that 
minimizes AIC and BIC values or offers better performance on training data using metrics such 


as RMSE, MAE, and MAPE. 


3.4.5 Software Libraries 
The statsmodels library in Python provides tools for plotting ACF, PACF, and decomposition 
plots, performing statistical tests (ADF, KPSS, AIC, and BIC), fitting the ARIMA model, and 


performing forecasting and validation steps.!* 
The scikit-learn library offers methods for Grid Search, and Random Search etc. ? 


However, libraries like pmdarima provide specialized functions for finding the optimal ARIMA 
model, eliminating the need for pre-processing and custom implementation, saving time, and 


reducing the scope for errors.?? 


17 Sumi. “Understand ARIMA and Tune P, D, О.” Kaggle, 20 Aug. 2018, 
www.kaggle.com/code/sumi25/understand-arima-and-tune-p-d-q. Accessed 22 Dec 2023. 


15 Time Series Analysis Tsa - Statsmodels 0.14.1. www.statsmodels.org/stable/tsa.html#descriptive-statistics-and- 
tests. Accessed 27 Dec 2023. 

I? Rendyk. “Tuning the Hyperparameters and Layers of Neural Network Deep Learning.” Analytics Vidhya, 12 Jan. 
2024, www.analyticsvidhya.com/blog/202 1/05/tuning-the-hyperparameters-and-layers-of-neural-network-deep- 
learning. Accessed 21 Dec 2023. 


20 pmdarima.arima.auto __ _arima — Pmdarima 2.0.4 Documentation. https://alkaline- 
ml.com/pmdarima/modules/generated/pmdarima.arima.auto_arima.html. Accessed 20 Dec 2023. 
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3.5 Long Short-Term Memory (LSTM) 


LSTM, on the other hand, is a type of Recurrent Neural Network (RNN)?! that addresses the 
limitations of conventional RNNs, such as vanishing and exploding gradients, enabling them to 


learn long-term dependencies in sequential data, which conventional RNNs fail to capture.?? 


3.51 RNN 

RNNs are deep learning models that can be trained to process sequential data and give an 
output. Unlike traditional neural networks where dataflow is unidirectional, RNNs have a 
feedback mechanism allowing the data to flow in both directions, allowing them to retain past 
data for future use. Their ability to ‘memorize’ makes КММ“ suitable for applications needing the 
identification of dependencies and patterns in sequential data, such as time series forecasting, 


speech recognition, and natural language processing.? 


RNNs are made of neurons, organised into input, hidden, and output layers. The input layer 
receives the incoming data and passes it to the hidden layers, one step at a time. The hidden 
layer(s) process this incoming data, combining it with ‘memorized’ data to generate an output 
that is then passed to the output layer. The feedback loop in the hidden layer(s) allows them to 


retain previous inputs for combining with each incoming next input to generate an output 


?! Barla, Nilesh. ^Recurrent Neural Network Guide: A Deep Dive in RNN." neptune.ai, 22 Aug. 2023, 
https://neptune.ai/blog/recurrent-neural-network-guide. Accessed 20 Dec 2023. 


2 Understanding LSTM Networks — Colah's Blog. https://colah.github.io/posts/2015-08-Understanding-LSTMs. 
Accessed 20 dec 2023. 


3 Kalita, Debasish. “A Brief Overview of Recurrent Neural Networks (RNN).” Analytics Vidhya, 7 Nov. 2023, 
www.analyticsvidhya.com/blog/2022/03/a-brief-overview-of-recurrent-neural-networks-rnn. Accessed 22 Dec 2023. 
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modulated by past observations, which is again fed back. This recurrent feedback mechanism 


gives RNNs the ability to learn from past data.” 


However, when the incoming data changes too quickly or too slowly, an RNN may struggle to 
adjust its parameters appropriately, leading to the exploding and vanishing gradient problems, 


resulting in overfitting or underfitting of the model.” 


3.5.2 LSTM Neural Networks 

LSTM, a type of RNN with a modified architecture, addresses the above limitations of 
conventional КМ“ by incorporating additional memory cells and gates, allowing them to retain 
or discard information selectively, making them suitable for applications such as time series 


forecasting, where long-term dependencies are prevalent.” 


24 Kalita, Debasish. “A Brief Overview of Recurrent Neural Networks (RNN)." Analytics Vidhya, 7 Nov. 2023, 
www.analyticsvidhya.com/blog/2022/03/a-brief-overview-of-recurrent-neural-networks-rnn. Accessed 22 Dec 2023. 


25 Kalita, Debasish. “A Brief Overview of Recurrent Neural Networks (RNN)." Analytics Vidhya, 7 Nov. 2023, 
www.analyticsvidhya.com/blog/2022/03/a-brief-overview-of-recurrent-neural-networks-rnn. Accessed 22 Dec 2023. 


?6 Understanding LSTM Networks — Colah's Blog. https://colah.github.io/posts/2015-08-Understanding-LSTMs. 
Accessed 20 dec 2023. 
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3.5.2.1 Architecture 


Figure 1: Architecture of LSTM (Source: Understanding LSTM Networks -- Colah's Blog) 


С] О -- >» < 


Neural Network Pointwise Vector 


Layer Operation Transfer concatenate Copy 


Figure 2: Symbol Notation in Fig 1 (Source: Understanding LSTM Networks -- Colah’s Blog) 


In the architectural diagram of LSTM given in Fig 1, each line shown carries an entire data 
vector, from the output of one node to the inputs of other nodes. The pink circles represent 
pointwise operations, like vector addition and multiplication, while the yellow boxes are learned 
neural network layers. Merging lines denote the concatenation of data, while forking lines denote 


the copying and distribution of data to different locations.” 


27 Understanding LSTM Networks — Colah's Blog. https://colah. github.io/posts/2015-08-Understanding-LSTMs. 
Accessed 20 dec 2023. 
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LSTMs have a chain-like structure with four layers comprising three gates and an update layer, 
which operate on the input data sequentially, as depicted in Figure 3. The sequence of operations 


is given in succeeding paras. 


Forget Gate 


Input Gate Output Gate 


Figure 3: Schematic Diagram of LSTM Representing Gates. The horizontal line on top represents the cell state 
(Source: Understanding LSTM Networks -- Colah's Blog) 
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3.5.2.2 Operation 
Forget Gate: The forget gate layer (Figure 4) processes the combination of the previous hidden 
state (h1) and the current input (хі). It decides which information from the cell state should be 


discarded and which should be passed on. It is a Sigmoid layer.?? 


f fr =o (We [hi1 4] + bf) 


Figure 4: Schematic Representation of LSTM Forget Gate (Source: Understanding LSTM Networks -- Colah’s 
Blog) 
Input Gate: The input gate layer (Figure 5) processes the combination of the previous hidden 
state (nt-1) and the current input (x;). It decides which new information should be stored in the 
cell state. It comprises a sigmoid layer (to determine which values i; to update) and a tanh layer 


(to create a vector of new candidate values C;).? 


28 Understanding LSTM Networks — Colah's Blog. https://colah.github.io/posts/2015-08-Understanding-LSTMs. 
Accessed 20 dec 2023. 


29 Understanding LSTM Networks — Colah's Blog. https://colah.github.io/posts/2015-08-Understanding-LSTMs. 
Accessed 20 dec 2023. 
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it = о (Wi-[hi-1,24] + bi) 
С; = tanh(Weo-[he_1, x] + бс) 


Figure 5: Schematic Representation of LSTM Input Gate (Source: Understanding LSTM Networks -- Colah's 
Blog) 


Cell State Update: The cell state (Figure 6) is updated by combining the information from the 
forget gate (f;) and the information from the input gate (C;.;). The forget gate decides what to 
remove from the cell state, and the input gate decides what to add. This updated cell state 


becomes the memory of the LSTM.?? 


4 em C, = fu Ca + it ж С, 


Figure 6: Schematic Representation of LSTM Cell State Update Step (Source: Understanding LSTM Networks - 
- Colah's Blog) 


30 Understanding LSTM Networks — Colah's Blog. https://colah.github.io/posts/2015-08-Understanding-LSTMs. 
Accessed 20 dec 2023. 
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Output Gate: The output gate layer (Figure 7) processes the combination of the previous hidden 
state (/;-г) and the current input (x), like the forget and input gates. It determines the next hidden 
state (Л) based on the updated cell state. The output gate includes a sigmoid layer (to determine 
which values of the cell state to output) and a tanh layer (to transform the values between -1 and 


1).3! 


ot = о (Wo [hi-1, xt] + bo) 
hi = o, * tanh (Ci) 


Figure 7: Schematic Representation of Output Gate (Source: Understanding LSTM Networks -- Colah's Blog) 


3.5.3 Forecasting with LSTM Models 


3.5.3.1 Parameters and Hyperparameters of LSTM 

In LSTM, parameters and hyperparameters are two different but related concepts. The model's 
hyperparameters are top-level parameters that control the learning process and determine the 
model parameters. They are to be determined by the model designer before training begins 
and remain unchanged at the end of the learning process. The hyperparameters of the model 


include the number of hidden layers, number of neurons in each hidden layer, number of epochs, 


31 Understanding LSTM Networks — Colah's Blog. https://colah.github.io/posts/2015-08-Understanding-LSTMs. 
Accessed 20 dec 2023. 
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batch size, dropout rate, optimizer, loss function, stateful, shuffle, and reset states, each having 


an impact on the performance of the model.*” 


On the other hand, the model's parameters are internal to the model and are learned by the 
model during training based on the data and the hyperparameters. The parameters of the model 
include the weights and biases of the model, which are updated during training, unlike the 


hyperparameters, which remain unchanged.?? 
Optimal choice of the hyperparameters is crucial for the model to perform well. 


3.5.3.2 Tuning the Hyperparameters of LSTM Models 

The designer usually manually determines the hyperparameters of LSTM models based on 
domain expertise and experience. An alternative approach is to employ search algorithms such as 
grid search, random search, and Bayesian optimization to find the optimal combination of 


hyperparameters that minimizes the loss function.?* 


The KerasTuner library in Python provides a flexible and efficient way to perform 
hyperparameter tuning using grid search and random search, eliminating the need for custom 


implementation, saving time, and reducing the scope for errors.’ 


32 Nyuytiymbiy, Kizito. “Parameters, Hyperparameters, Machine Learning | Towards Data Science." Medium, 7 
Mar. 2023, https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac. Accessed 20 Dec 2023. 


33 Nyuytiymbiy, Kizito. “Parameters, Hyperparameters, Machine Learning | Towards Data Science." Medium, 7 
Mar. 2023, https://towardsdatascience.com/parameters-and-hyperparameters-aa609601a9ac. Accessed 20 Dec 2023. 


34 Rendyk. “Tuning the Hyperparameters and Layers of Neural Network Deep Learning.” Analytics Vidhya, 12 Jan. 
2024, www.analyticsvidhya.com/blog/202 1/05/tuning-the-hyperparameters-and-layers-of-neural-network-deep- 
learning. Accessed 21 Dec 2023. 


35 Team, Keras. Keras Documentation: KerasTuner API. https://keras.io/api/keras_tuner. Accessed 12 Dec 2023. 
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4 Experimental Methodology 


The experimental setup for the study was implemented in Python 3.11 in a Jupyter Notebook 


environment. The steps followed to answer the research question are discussed below. 


4.1 Step 1: Data Collection 


Historical 5-year stock price data from 01 January 2018 to 01 January 2023 was obtained from 
Yahoo Finance, using the yfinance library, for the following ten companies in the S&P 500 
index, representing a diverse set of sectors to avoid sector bias, account for macroeconomic 


factors, and to make the results more generalizable.*° 37 


Ticker Symbol Company Name Sector 
GOOG Alphabet Inc. Technology 
JPM JPMorgan Chase & Co. Financial Services 
JNJ Johnson & Johnson Healthcare 
WMT Walmart Inc. Consumer Defensive 
TSLA Tesla Inc. Automobiles 
AMZN Amazon.com Inc. E-Commerce 
BP BP plc Oil & Gas 
NKE Nike Inc. Apparel 


36 “Yahoo Finance - Stock Market Live, Quotes, Business and Finance News." Yahoo Finance - Stock Market Live, 
Quotes, Business & Finance News, finance.yahoo.com. 


37 *Y finance." PyPI, 21 Jan. 2024, pypi.org/project/yfinance. 
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Ticker Symbol Company Name Sector 


KO The Coca-Cola Company Beverages 
PFE Pfizer Inc. Pharmaceuticals 


(Table 1: List of stocks chosen for the study) 


A period of five years was chosen as it provided sufficient data for analysis, and the 


computational cost of the models was manageable. 


4.2 Step 2: Data Visualization and Preprocessing 


4.2.1 Data Cleaning 

The data obtained from Yahoo Finance was cleaned by filling in missing values with the 
previous day's closing price, indexed by datetime, and sorted in ascending order. The Adj 
Close prices of the stocks were filtered for further analysis, as they account for post-market 


action, which can impact the price on the next trading day. 


4.2.2 Data Visualization 
The Adj Close prices, decomposed components (trend, seasonality, and residual), ACF, and 


PACF plots for each stock were plotted to visualize the data and identify any patterns. 


The plots revealed the presence of trends and seasonality in the data, indicating that the data is 


not stationary. The plots for GOOG are provided below: 


Page |21 


Adjusted Close Price GOOG 


--- соос 


Adjusted Closing Price. 


© У > 
Ф Ф ” $ $ ” 


Date 


Figure 8: Adjusted Close Price of Alphabet Inc. (GOOG) 


Rolling Mean and Standard Deviation for GOOG 
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Figure 9: Plot of Rolling Mean and Standard Deviation of Adjusted Close Price of Alphabet Inc. (GOOG) 
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Seasonal Decomposition of GOOG (Multiplicative Model) 


Int i 
е series is „її: 
Both trend and seasonal components should appear stable and not exhibit any pattern or trend. 
The residual component should be random and not exhibit any pattern or trend. 
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Figure 
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ACF and PACF of GOOG 


Time Series Analysis Plots of ACF and PACF of GOOG Adjusted Close Price 
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Figure 11: ACF and PACF plots of Adjusted Close Prices of GOOG 


Statistical tests such as the Augmented Dickey-Fuller (ADF) and Kwiatkowski-Phillips-Schmidt- 


Shin (KPSS) using the statsmodels library also confirmed the same. 


Summary of ADF and KPSS test results for GOOG are given in Fig 12 below: 


Performing Augmented Dickey-Fuller Test on GOOG 
Results of Augmented Dickey-Fuller Test: GOOG 
Test Statistic -1.210835 


р-уаше 0.668918 
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#Lags Used 1.000000 


Number of Observations Used 1257.000000 


Critical Value (1%) -3.435563 
Critical Value (5%) -2.863842 
Critical Value (10%) -2.567996 


dtype: float64 

p-value: 0.6689176927500179 

ADF test indicates that GOOG is not stationary as p-value is greater than 0.05 

ADF test indicates that GOOG is not stationary as Test Statistic is greater than Critical Value 
in 5% 

End of Augmented Dickey-Fuller Test 

Performing KPSS Test on GOOG 


Results of KPSS Test:GOOG 


Test Statistic 4.654283 
p-value 0.010000 
#Lags Used 21.000000 


Critical Value (10%) 0.347000 
Critical Value (5%) 0.463000 
Critical Value (2.5%) 0.574000 


Critical Value (1%) 0.739000 
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dtype: float64 
KPSS test indicates that GOOG is not stationary as Test Statistic is greater than Critical Value 
in 596 


End of KPSS Test 


Figure 12: Output of Stationarity Test using ADF and KPSS Tests 


4.2.3 Pre-Processing of Data 


4.2.3.1 ARIMA 
The data was made stationary by differencing the time series till effects of trends and seasonality 


were removed, i.e., ADF and КР55 tests returned True. 


4.2.3.2 LSTM 
In the case of LSTM, data was scaled to the range [0,1] and then split into the training and 


testing sets. MinMaxScaler function in the sklearn library was used to scale the data. 


4.2.4 Splitting the Data into Training and Testing Sets 
The differenced or scaled data of ARIMA and LSTM were split into training and testing data sets 


using a split ratio of 95:5. 


4.3 Step 3: Model Implementation 


The training data was fitted to the ARIMA and LSTM models, and the testing data was used to 


evaluate the performance of each model. 
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4.3.1 Finding Optimal Parameters for ARIMA 
The number of times the data was differenced during the pre-processing stage to achieve 


stationarity determined the order of differencing (4). 


The ACF and PACF plots of the differenced time series were analyzed to determine the 


autoregression (p) and moving average (q) order, respectively. 


The estimated values of p, д and q for Alphabet Inc. (GOOG) are pz 1, 4=1 and q- 1, indicating 


that the ARIMA model for GOOG is given by ARIMA(1, 1,1). 


4.3.2 Estimation of ARIMA Model Parameters Using pmdarima Library 
The manual approach discussed above involved a non-scalable and error-prone process due to its 
reliance on visualization and analysis of plots. Therefore, this method was applied solely to one 


stock, Alphabet Inc., to understand the process involved. 


However, to obtain the experimental results for this study, the auto агіта function of 
the pmdarima library was preferred as it allowed for the determination of the optimal ARIMA 
model using a programmatic approach and offered benefits as discussed in sub-section 3.4.5 in 


the Theoretical Background. 


The code used for experimentation is documented in Appendix A. 


4.3.3 Hyperparameters Tuning for LSTM 

Tuning Ayperparameters in LSTM is a complex task requiring domain expertise, time, and 
computational resources. The KerasTuner library, designed for hyperparameter optimization and 
offering several benefits, as discussed in sub-section 3.5.3.2 in the Theoretical Background 


section, was employed to tune the LSTM model. 
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Stateful vs. Stateless LSTMs. A stateful LSTM was chosen for its ability to capture 

the trends and seasonality in the data, if any, and improve the accuracy of the 

predictions. Keras library was explicitly configured by setting the stateful parameter to True, 
the batch, size parameter to 1 to preserve the state within the same epoch, the shuffle parameter 
to False to preserve the order of the data, and the reset states parameter to True to reset 


the state after each epoch. 


4.4 Step 4: Prediction 


The time horizon for predictions, for example, a daily, weekly, monthly, or yearly forecast, 
depends on the application's need. However, forecasting far into the future reduces prediction 


accuracy due to the absence of recent data and the compounding of errors. 


A model’s forecast accuracy can be improved by using a rolling forecast by feeding back the 


latest observed data to make the following prediction.?? 


This study employed a one-step rolling forecast to predict the stock price one day at a time, 


continuously updating the model with the latest available data. 


4.5 Step 5: Evaluation of Models 


The training data was fitted to the ARIMA and LSTM models, and the testing data was used to 


evaluate the model's performance. Predictions were made using a one-step rolling forecast 


38 Brownlee, Jason. “Time Series Forecasting With the Long Short-Term Memory Network in Python.” 
MachineLearningMastery.com, 27 Aug. 2020, machinelearningmastery.com/time-series-forecasting-long-short- 
term-memory-network-python. Accessed 25 Dec 2023. 
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approach. The performance of the models was evaluated by comparing the predicted values with 


the actual values, using statistical metrics RMSE, MAE, and MAPE. 


5 Results and Analysis 


5.] Performance Metrics on a Per-Stock Basis 


The performance metrics obtained on a per-stock basis are tabulated below (output of Step 5 of 


Appendix A) 
RMSE MAPE MAE 

Symbol ARIMA LSTM ARIMA LSTM ARIMA LSTM 
GOOG 2.5754 4.5959 2.0113 3.9893 1.9010 3.7554 
JPM 2.1378 4.3924 1.3682 2.9673 1.6205 3.5093 
JNJ 1.5489 2.6498 0.7146 1.3014 1.1852 2.1586 
WMT 1.9617 3.2097 1.0048 1.8503 1.3913 2.5982 
TSLA 8.2851 18.7155 3.4410 8.6554 6.2608 15.3805 
AMZN 3.0969 5.7947 2.3246 4.7190 2.3032 4.6382 
BP 0.5660 1.0318 1.4066 2.7565 0.4433 0.8559 
NKE 2.7681 5.1574 1.9981 3.8467 1.9857 3.8372 
KO 0.6674 1.2222 0.8930 1.5761 0.5143 0.9077 
PFE 0.7239 1.2188 1.2161 2.2088 0.5469 0.9912 


(Table 2: Comparison of ARIMA and LSTM performance on a per stock basis) 
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5.1.1 Interpretation of Performance Metrics on а Per Stock basis 
From Table 2, it is seen that RMSE, MAE, and MAPE values are lower for ARIMA than LSTM 
across all selected stocks, suggesting that the ARIMA was more effective in predicting stock 


prices compared to LSTM for the underlying data. 
This is likely to be owing to the following reasons: 


e Time Series data of all the selected stocks could be made stationary by differencing, 
making it amenable for the application of ARIMA. 

e ARIMA relies on regression analysis, which is well-suited for fitting curves to stationary 
data and apply forecasting methods. 

е The forecasting was carried out on a one-step rolling basis, where the most recent data 
(previous day's Adjusted Closing Price) was available for forecasting the next day's 
stock price. As the next day's stock prices are usually very closely dependendent on the 
most recent data (barring exceptional scenarios), ARIMA which relies on regression- 
based analysis could predict the next day's stock price quite accurately. 

e Further, LSTM is ideal for capturing long-term dependencies and complex patterns 
inherent in sequential data. In predicting the next day's stock prices, especially using a 
one-step rolling forecast method, ARIMA fared better as recency of data was equally or 


possibly more important than long-term dependencies. 


5.2 Performance on Aggregate Basis 


The results of the comparison on an aggregate basis are tabulated below (output of Step 5 of 


Appendix A): 
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Metric ARIMA | LSTM | % Improvement | Remarks 

Average RMSE | 2.43311 | 4.79882 49.298% | ARIMA performed better 
Average MAE | 1.63783 | 3.38708 51.645% | -do- 

Average МАРЕ | 1.81522 | 3.8632 53.013% | -do- 


(Table 3: Comparison of ARIMA and LSTM Performance on Aggregate Basis) 


5.2.1 Interpretation of Aggregate Metrics 


5.2.1.1 Overall Performance 

ARIMA outperformed LSTM across all metrics ——RMSE, MAE, and MAPE—at an aggregate 
level, showing approximately 49.3%, 51.6%, and 53.0% improvement, respectively. This 
indicates that ARIMA was more accurate in making predictions compared to LSTM for the 


considered dataset as a whole. 


5.2.12 RMSE 
A lower RMSE of ARIMA indicates that ARIMA could predict stock prices closer to the actual 
values than LSTM. This means that the average magnitude of errors in ARIMA's predictions was 


smaller compared to LSTM, resulting in more precise forecasts. 


5.2.1.3 MAE 

Lower observed MAE of ARIMA signifies that ARIMA could predict stock prices closer to the 
actual values on an absolute basis, and it made less biased predictions than LSTM. This suggests 
that ARIMA's predictions were, on average, closer to the true values, with less systematic 


overestimation or underestimation compared to LSTM. 
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5.2.1.4 МАРЕ 

ARIMA had а lower aggregate МАРЕ than LSTM, implying that ће ARIMA model's predicted 
values had lesser percentage deviations from actual values on an absolute basis. This indicates 
that ARIMA's predictions had, on average, smaller percentage errors compared to LSTM, 


making it a more reliable model for forecasting stock prices. 


5.2.1.5 Overall Assessment 
The likely reasons for better observed performance of ARIMA compared to LSTM on aggregate 


basis is also likely owing to similar reasons as explained in Section 5.1.1. 


5.3 Visualization 


Plots of predicted and actual test data values have been generated for all considered stocks for 
visualization and analysis (outputs of steps 4 and 5 of Appendix A). Plots for Alphabet 


Inc.(GOOG) are given below for discussion. 
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Figure 13: Actual vs Predicted Stock Prices GOOG: ARIMA (output of Step 3 of Appendix A) 
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GOOG: Adj Close : (Actual Vs Predictions):LSTM 
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Figure 14: Actual Vs Predicted Stock Prices СООС: LSTM (output of Step 4 of Appendix А) 


5.3.1 Interpretation of Plots 

The above plots for ‘Actual versus Predicted Stock Prices’ using ARIMA and LSTM 
demonstrate the ability of both models to predict stock prices to varying degrees of accuracy. It 
can also be seen that prices predicted by ARIMA followed the actual stock price more closely 
than LSTM. This is consistent with the observed performance metrics for GOOG using ARIMA 
and LSTM, RMSE (2.5754 vs 4.5959), MAPE (2.0113 Vs 3.9893) and MAE (1.9010 Vs 3.7554) 


respectively. 


Further, it is noteworthy that predictions using ARIMA tended follow the actual prices more 
immediately compared to LSTM. This behaviour is attributable primarily to the employment of 
one-step rolling forecast method. LSTM exhibited a slower response to sharp changes in stock 


prices. This behaviour is likely attributable to several factors inherent in LSTM’s model 
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architecture, including its ability to capture long-term dependencies and the incorporation of 


such dependencies into its predictions. Consequently, LSTM may have moderated the influence 


of recently observed data, resulting in a slower adjustment to changes in stock prices. 


5.3.2 Plots of ‘Actual versus Predicted Prices (using ARIMA and LSTM)' of Other Stocks 


Plots of other stocks obtained as output from Steps 3 and 4 of Appendix A are listed below. 


These show similar patterns consistent with the above analysis. 
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Figure 15: Actual vs Predicted Stock Prices JPM: ARIMA (output of Step 3 of Appendix A) 
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JPM: Adj Close : (Actual Vs Predictions): LSTM 
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Figure 16: Actual vs Predicted Stock Prices JPM: LSTM (output of Step 4 of Appendix A) 
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Figure 17: Actual vs Predicted Stock Prices JNJ: ARIMA (output of Step 3 of Appendix A) 
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Figure 18: Actual vs Predicted Stock Prices JNJ: LSTM (output of Step 4 of Appendix A) 
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Figure 19: Actual vs Predicted Stock Prices WMT: ARIMA (output of Step 3 of Appendix A) 
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Figure 20: Actual vs Predicted Stock Prices WMT: LSTM (output of Step 4 of Appendix A) 
TSLA: Adj Close Prices: ARIMA - Observed Vs Predictions 
— Actual 
260 ----- Predictions 
240 
220 
200 
$ 
О 
o 
S 180 


160 


140 


120 


o ә © х o o o м 
ы ^ v о м Ў 
5^ ме "i o^ y y of of 
Vv ЧА 5У V Vv V 5У $V 
Ф Ф Ф Ф Ф P Ф Ф 
Date 


Figure 21: Actual vs Predicted Stock Prices TSLA: ARIMA (output of Step 3 of Appendix A) 
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TSLA: Adj Close : (Actual Vs Predictions):LSTM 
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Figure 22: Actual vs Predicted Stock Prices TSLA: LSTM (output of Step 4 of Appendix A) 
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Figure 23: Actual vs Predicted Stock Prices AMZN: ARIMA (output of Step 3 of Appendix A) 
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Figure 24: Actual vs Predicted Stock Prices AMZN: LSTM (output of Step 4 of Appendix A) 
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Figure 25: Actual vs Predicted Stock Prices BP: ARIMA (output of Step 3 of Appendix A) 
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BP: Adj Close : (Actual Vs Predictions):LSTM 
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Figure 26: Actual vs Predicted Stock Prices BP: LSTM (output of Step 4 of Appendix A) 
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Figure 27: Actual vs Predicted Stock Prices NKE: ARIMA (output of Step 3 of Appendix A) 
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Figure 28: Actual vs Predicted Stock Prices NKE: LSTM (output of Step 4 of Appendix A) 
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Figure 29: Actual vs Predicted Stock Prices KO: ARIMA (output of Step 3 of Appendix A) 
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Figure 30: Actual vs Predicted Stock Prices KO: LSTM (output of Step 4 of Appendix A) 
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Figure 31: Actual vs Predicted Stock Prices PFE: ARIMA (output of Step 3 of Appendix A) 
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Figure 32: Actual vs Predicted Stock Prices PFE: LSTM (output of Step 4 of Appendix A) 


6 Conclusion 


It can be concluded, based on empirical evidence, that ARIMA can predict stock prices more 


accurately than LSTM. 


However, the accuracy of stock price predictions is sensitive to various factors, such as the 

underlying stock data, the values assigned to ARIMA parameters, the chosen LSTM architecture, 
and the tuning of its hyperparameters. Further, the methodology used for predicting stock prices, 
the frequency of updating the model with the observed values and the forecast period also have a 


significant impact on the accuracy of predictions. 


Therefore, it would be incorrect to generalize the above conclusion and extend it to all situations. 
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This possibly explains the reasons for continued exploration of this topic by the researchers, each 


investigation coming up with different findings. 


7 Limitations and Future Work 


In the conduct of this study, predictions were made solely based on historical data of a single 
variable, namely, the adjusted close price. However, it is common knowledge that stock prices 
are influenced by various factors, such as the macro and microeconomic data, the company's 
financial performance, government policies, market sentiment and natural and manmade 
disasters. Therefore, alternate approaches employing models which can incorporate multiple 
variables, such as multivariate LSTM and hybrid models, could yield more accurate results. 
Exploration of such models was beyond the current scope and has been marked for future 


research. 


In this extended essay, a one-step rolling forecast method was used, with the model predicting 
the next day's stock price and thereafter being updated with the observed value prior to making 
the next prediction. While this approach served to standardize the comparison of ARIMA and 
LSTM, practical considerations may necessitate making stock price predictions on a weekly, 
monthly, yearly, or any other time period. The code developed for the study can be extended 
with minor modifications to extend the study to compare the performance of ARIMA and LSTM 
for such arbitrary time periods. Such a study can add more insights into understanding the overall 


performances of ARIMA and LSTM. 


Lastly, as mentioned in various sections of the paper, tuning parameters and hyperparameters 


and choosing an optimal model have a significant impact on the experimental outcomes. A more 
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thorough study to understand the impact on the accuracy of predictions by varying 
hyperparameters is needed. This needs a deeper understanding of the domain and further study; 


thus, the same has been reserved for future exploration. 
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9 Appendix A: Performance Evaluation of ARIMA and LSTM in 


Stock Price Prediction 


This Appendix contains the code for carrying out the Performance evelaution of ARIMA and 


LSTM in predicting stock prices. Steps Involved in the Analysis: 


1. Importing the required libraries and setting the configuration parameters. 


2. Importing the dataset from Yahoo Finance using yfinance library and saving it to a CSV file for later use. 


3. Perform rolling forecast ARIMA modeling on the dataset for each stock. This invoves: 
- Data Preprocessing involving: 


Loading the CSV file into a Pandas DataFrame. 

Checking for missing values and filling them with the previous day's values. 

Sorting the data in ascending order of date. 

Converting the index to a datetime object. 

Filtering the data to include only the Date and Adj Close columns which will be used 
for analysis. 

Converting the Adj Close price to a float32 type, as it speeds up the computation and 
is the default type for auto.arima() function. 


Е Splitting the data into train and test sets, and visualizing the train and test sets. 


= Building the ARIMA model using auto.arima() function with necessary parameters for 
optimization. 


- Predicting the stock price using the ARIMA model One-Step Rolling Forecast one day at a time 
for the test set. 

- Evaluating the model performance using the root mean squared error (RMSE), mean absolute 
error (MAE), and mean absolute percentage error (MAPE) metrics and time taken for model 
training and prediction. 

- Printing the metrics and time taken for model training and prediction. 
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- Visualizing the predictions by plotting the predicted and actual stock prices for the test set along 
with the metrics. 

- Visualizing the residuals by plotting the residuals and density plot of the residuals. 

- Saving the predictions to a CSV file for further analysis 

- Saving the performance metrics to a CSV Ше for further analysis. 

= Saving the plots to PNG files for further analysis. 

= Calulating average RMSE, MAE, and МАРЕ for all the test sets. This is called cross validation. 
We will use cross validation to compare the performance of different models, i.e, with LSTM in 
the next steps. 

- Saving the average performance metrics to a CSV file for further analysis. 

4. — Similarly, perform rolling forecast using LSTM and save the predictions and performance metrics to CSV 
files for further analysis. However, in case of LSTM, the data is scaled using MinMaxScaler and reshaped 
to a 3D array before training the model. Further, we use keras-tuner library to tune the hyperparameters 
of the LSTM model (similar to auto.arima() function) and use the best hyperparameters to train the 
model and predict the stock price. The hyperparameters tuned are: 

- Number of LSTM layers 
= Number of LSTM units 
- Number of epochs 
- Batch size 

Dropout rate 


5. | Compare the performance of ARIMA and LSTM models using the average RMSE, MAE, and МАРЕ 
metrics. 


9.] Step 1: Importing the required libraries and setting the configuration 


parameters. 


4 Step 1: Import libraries 
import os 

import pandas as pd 

import numpy as np 

import matplotlib.pyplot as plt 
import seaborn as sns 

import yfinance as yf 


from sklearn.metrics import mean squared error, mean absolute error, max error, r2 score, median absol 
ute error, mean absolute percentage error 
from sklearn.preprocessing import MinMaxScaler 


from keras.models import Sequential 

from keras.layers import Dense 

from keras.layers import LSTM 

from keras.callbacks import EarlyStopping 


from keras tuner.tuners import GridSearch 


from math import sqrt 

import warnings 

# supress warnings 

warnings. filterwarnings('ignore' ) 

# Function to check stationarity using ADF test 
from matplotlib.ticker import MaxNLocator 
from pmdarima.arima import auto_arima 

import time 
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from tabulate import tabulate 


# set styles for plots 

sns.set theme(style-'whitegrid', palette-'muted', font scale-1.2) # style options - white, dark, whiteg 
rid, darkgrid, ticks; palette options - muted, deep, pastel, bright, dark, colorblind; font scale optio 
ПБ 14. 125. 4 

# changes the scale of the plot. other options include: paper, notebook, talk, poster. paper is suitabl 
e for saving as pdf or for reports 

sns.set context('paper') 


# Step 1A: Define variables for configuration 


# the directory where the data will be saved 

data dir - 'data' 

# the directory where the ARIMA results will be saved 

results dir - 'results' 

# the directory where the ARIMA plots will be saved 

plots dir - 'plots' 

# the list of tickers to be used for analysis 

tickers = ['GOOG', 'JPM', 'JNJ', 'WMT', 'TSLA', 'AMZN', 'BP', 'NKE', "КО", 'PFE'] 


# the start date of the data to be downloaded 
# start date = '2018-01-01' 
start date - '2018-01-01' 
# the end date of the data to be downloaded 
end date - '2023-01-01' 
# the column name to use for analysis 
column name - 'Adj Close' 
# the ratio to split the data into train and test sets 
split ratio - 0.95 


it ltsm variables 
# the number of previous time steps to use as input variables to predict the next time period 
look back - 10 
# number of batches to use for training at each epoch 
batch_size = 1 
# number of epochs to train the model 
nb_epoch = 10 
# maximum number of neurons to use r 
neurons = 4 
# petience for early stopping 
early stopping patience = 3 
# number of days to predict 
days to predict = 1 


# check if data directory exists; else create it so that data can be saved there 


print('Checking if data directory exists...') 
if not os.path.exists(data dir): 
print('Data directory does not exist. Creating data directory...') 


os.makedirs(data dir) 

print('Data directory created.') 
else: 

print('Data directory exists.') 


# check if results directory exists; else create it so that results can be saved there 


print('Checking if results directory exists...') 
if not os.path.exists(results dir): 
print('Results directory does not exist. Creating results directory...') 


os.makedirs(results dir) 

print('Results directory created.') 
else: 

print('Results directory exists.') 
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# check if arima plots directory exists; else create it so that plots can be saved there 
print('Checking if plots directory exists...') 
if not os.path.exists(plots dir): 
print(' plots directory does not exist. Creating plots directory...') 
os.makedirs(plots dir) 
print(' plots directory created.') 


else: 


print(' plots directory exists.') 


9.2 Step 2: Importing the dataset from Yahoo Finance using yfinance library and 


saving it to a CSV file for later use. 


# function to get data from yfinance and save as CSV 
def get ticker data and save as csv(ticker, start date, end date, data dir): 


A function to get data from yfinance and save as CSV. 


Parameters 


ticker : str 


The ticker symbol of the stock. 


start date : str 


The start date of the data to be downloaded. 


end date : str 


The end date of the data to be downloaded. 


data dir : str 


The directory where the data will be saved. 


Returns 


data : DataFrame 


The data downloaded from yfinance. 


# Validate inputs 
if not all([ticker, start date, end date, data dir]): 


try: 


raise ValueError( 
"А11 input parameters (ticker, start date, end date, data dir) must be provided.') 


# Get data from yfinance 

print(f'Getting data for {ticker}...') 

data = yf.download(ticker, start-start date, end-end date) 
print('Done."') 


it Sanitize data 

print('Sanitising data...') 

data.index - pd.to datetime(data.index) 
data - data.sort index() 


# Check for missing values 


print('Checking for missing values...') 
if data.isnull().values.any(): 
print('Data contains missing values. Using ffill method to fill missing values...') 


data = data.fillna(method-'ffill') 


# Save data to data dir 

output file = os.path.join(data dir, f'{ticker}.csv') 
print(f'Saving data to (output Ғі1е}...') 

data.to csv(output file) 
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print('Done.') 


except ValueError: 
raise ValueError(f'No data found for {ticker}.') 


return data 


# Function to load data from a CSV file 
def load data from csv(ticker, data dir): 


A function to load data from a CSV file. 


Parameters 
ticker : str 
The ticker symbol of the stock. 
data dir : str 
The directory where the data is saved. 


Returns 
data : DataFrame 
The data loaded from the CSV file. 


# Validate inputs 
if not all([ticker, data_dir]): 
raise ValueError( 
"А11 input parameters (ticker, data dir) must be provided.') 


# check if data directory exists; else raise error 
if not os.path.exists(data dir): 
raise ValueError(f'Data directory (data dir) does not exist.') 
it Load data from CSV file 
input file = os.path.join(data dir, f'(ticker).csv') 
# check if file exists; else raise error 
if not os.path.exists(input file): 
raise ValueError(f'File (input file) does not exist.') 


print(f'Loading data from (input file)...') 
data = pd.read csv(input file, index со1-0) 
print('Done.') 


return data 


# We iterate through the list of tickers and get data for each ticker 
for ticker in tickers: 
try: 
data from yahoo - get ticker data and save as csv(ticker, 
start date, 
end date, 
data dir) 
it Use the 'data' DataFrame as needed 
print("Obtained data for " + ticker + " from yfinance and saved to " + data dir + ", as 
.Csv" 
# Handle the error accordingly 
data = load data from csv(ticker, data dir) 
except ValueError as e: 
print(f"Error occurred: {e}") 


" 4 tic 


ker + 
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9.3 Step 3: Perform rolling forecast ARIMA modeling on the dataset for each 


stock. 


* define functions 


* Function to select a column from a dataframe 


def 


def 


get column data(df, column name): 


A function to select a column from a DataFrame. 


Parameters 
df : DataFrame 

The DataFrame to select the column from. 
column name : str 

The name of the column to select. 


Returns 
values : Series 
The values of the column. 


it Validate inputs 
if df is None: 
raise ValueError('df is required. ') 
if not isinstance(df, pd.DataFrame): 
raise ValueError('df should be a pandas DataFrame. ') 
if not isinstance(column_name, str): 
raise ValueError('column name should be a string.') 


# Check if the column exists in the DataFrame 
if column name not in df.columns: 


raise ValueError(f'Column "(column name)" does not exist in the DataFrame."') 


# Retrieve the column data 
values - df[column name] 
return values 


# plot original data series 
plot original data series(original data series, title text, 
column name-'Adj Close Price', 
index name-'Date', 
save path-None, 
file name-'original data series.png'): 


A function to plot the original data series. 


Parameters 
original data series : Series 
The original data series. 
title text : str 
The title of the plot. 
column name : str 


Returns 
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jpeg 


plt.figure(figsize-(10,6), dpi-100) 
plt.title(title text) 
plt.xlabel(index name) 
plt.ylabel(column name) 
plt.plot(original data series, label-'Original Data Series', color-'blue') 
plt.xticks(rotation-45) 
it Get the current axes 
ax - plt.gca() 
# Automatically set the number of x-axis ticks 
ax.xaxis.set major locator(MaxNLocator(integer-True)) 
dax.xaxis.set major locator(AutoLocator()) 
#plt.xticks(np.arange(@, len(original data series), len(original data series)/20)) 
plt.grid(visible-True, linestyle-'dotted', linewidth-0.5, axis-'both', which-'major', color='grey') 
plt.tight layout() 
plt.legend(loc-'best') 
it if save path is provided, save the plot to the path 
if save path is not None: 
# check file extension and add it if not present or replace it if present. accept png, jpg and 


if not file name.endswith('.png') and not file name.endswith('.jpg') and not file name.endswith 


(".jpeg'): 


file name = f'{file_name}.png' 
# save plot to the path 
plt.savefig(os.path.join(save_path, file name)) 


plt.show(block-False) 
# close plot 
plt.close() 


* Function to create metrics dataframe 


def 


def 


create metrics dataframe(rows): 
# Create an empty DataFrame with columns 'key', 'text', ‘value’ for holding metrics 
metrics - pd.DataFrame(columns-['key', 'text', 'value']) 
# Concatenate the rows to the metrics dataframe 
for row in rows: 
# Convert the row to a DataFrame 
row df - pd.DataFrame(row, index-[0]) 
# Concatenate the row to the metrics dataframe 
metrics - pd.concat([metrics, row df], ignore index-True) 


return metrics 


get predictions and metrics using arima(actual data indexed, train size): 
H +ЯНННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННННЕЕ 
# split the data into train and test sets 


train, test = actual data indexed[0:train size], actual data indexed[train size:len(actual data ind 


exed)] 


train values - train.values 

test values - test.values 

create a history list intially containing the training data set. 

we will use this for the initial prediction 

and with each prediction, we will append the actual value to the history list and use it for the ne 


xt prediction as input to the model 


history - [x for x in train values] 

# create a list to store the test predictions 
test predictions - list() 

# create a list to store the fitting time 
arima model and fit time in ms - 0 

# create a list to store the prediction time 
arima prediction time in ms = @ 
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it walk-forward validation for each time step in the test data set 
i-0 
for observed value in test values: 
# get time used for model and fit 
print('ARIMA: Predicting for : ' + str(i) + '/' + str(len(test values))) 
і += 1 
start time агіта model and fit = time.time() 
# define model configuration. use the history list as the input to the model 
model - auto arima(history, 
start p-1, start q-1, d-1, 
max p-5, max q-2, max d-2, 
D-1, max D-2, m-1, 
Seasonal-True, 
trace-False, 
error action-'ignore', 
suppress warnings-True, stepwise=True) 


it fit model 
arima model - model.fit(history) 
end time arima model and fit - time.time() 
arima model and fit time in ms += (end time arima model and fit - start time arima model and fi 
t) * 1000 
# print the summary of the model to get the model parameters 
summary - arima model.summary() 
print('ARIMA Model Summary: ') 
print(summary ) 
# get prediction for the next day 
start_time_predict = time.time() 
# Predict next value 
yhat, conf_int = arima_model.predict(n_periods=1, return_conf_int=True) 
end_time_predict = time.time() 
arima prediction time in ms += (end time predict - start time predict) * 1000 # convert to ms 
# store the prediction 
test predictions.append(yhat) 
# add the actual value to the history object for the next iteration to train the model 
history.append(observed value) 
# get time used for fit and predict 
arima total time for model fit and predict in ms = arima model and fit time in ms + arima predictio 
n time in ms 
# recovery is not needed since we are using auto arima and provided the original series as input to 
the model 
# Convert test predictions list into a Pandas Series 
flattened_test_predictions = [item for sublist in test_predictions for item in sublist] 
test_predictions_series = pd.Series(flattened_test_predictions, index=test.index) 
# calulate accuracy metrics 
rmse = sqrt(mean squared error(test values, test predictions)) 
mae - mean absolute error(test values, test predictions) 
mape - mean absolute percentage error(test values, test predictions) * 100 
max error value - max error(test values, test predictions) 
r2 - r2 score(test values, test predictions) 
median absolute error value - median absolute error(test values, test predictions) 


rows - [ 
('key': 'rmse', 'text': 'RMSE', 'value': rmse}, 
('key': 'mape', 'text': 'MAPE', 'value': mape), 
('key': 'r2', 'text': 'R2', 'value': r2), 


('key': "пах error value', 'text': "Мах Error', 'value': max error value], 

('key': 'mean absolute error', 'text': 'Mean Absolute Error', 'value': mae}, 

('key': 'median absolute error value', 'text': 'Median Absolute Error', 'value': median absolut 
e error value), 

('key': 'arima total time for model fit and predict in ms', 'text': 'Total Time for Fit and Pre 
дісі", 'value': агіта total time for model fit and predict in ms), 

('key': 'arima model and fit time in ms', 'text': 'Total Time for Model and Fit', 'value': arim 
a model and fit time in ms), 

('key': 'arima prediction time in ms', 'text': 'Total Time for Prediction', 'value': arima pred 


iction time in ms) 
] 
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# create a dataframe with columns - key, text, value for holding metrics 
metrics df = create_metrics_dataframe(rows) 

# return the model, test_predictions_series and metrics_df 

return arima model, test predictions series, metrics df 


# Function to plot the actual and predicted values 
def plot actual and predicted values(actual values, 
predicted values, 
title text, 
index name-'Date', 
column пате-"Ргісе", 
save path-zNone, 
file name-'actual vs predicted.png'): 


A function to plot the actual and predicted values. 


Parameters 


# plot orginal data series test and test predictions series 
plt.figure(figsize-(10,6), dpi-100) 

plt.title(title text) 

plt.xlabel(index name) 

plt.ylabel(column name) 

plt.plot(actual values, label-'Actual', color-'blue') 
plt.plot(predicted values, label-'Predictions', color-'orange') 
plt.xticks(rotation-45) 

it Get the current axes 

ax - plt.gca() 

it Automatically set the number of x-axis ticks 

ax.xaxis.set major locator(MaxNLocator(integer-True)) 

# show legend 

plt.legend(loc-'best') 


it if save path is provided, save the plot to the path 
if save path is not None: 
# check file extension and add it if not present or replace it if present. accept png, jpg and 
Jpeg 
if not file_name.endswith('.png') and not file_name.endswith('.jpg') and not file_name.endswith 
('.јрев'): 
file пате = f'(file пате} .рпе' 
# save plot to the path 
plt.savefig(os.path.join(save path, file name)) 


#show plot 
plt.show() 

# close plot 
plt.close() 


# create a map to hold the metrics for each ticker 
arima metrics map = {} 


for ticker in tickers: 
try: 

data = load data from csv(ticker, data dir) 

it Use the 'data' DataFrame as needed 

print("Loaded data for " + ticker + " from " + data dir + 

adj close data - get column data(data, column name) 

# lets call the adj close data as ''original data series' so that there is no confusion 

original data series - adj close data 

# plot the original data series for visual inspection 

plot original data series(original data series, 
f'(ticker): (column name) Prices: (Original Data Series)', 
'Original Data Series', 


ә as " + ticker + ".csv" 
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'Date', 
plots dir, 
f'(ticker) original data series.png') 
# split the data into train and test data 
train size - int(len(original data series) * split ratio) 
original data series train - original data series[:train size] 
original data series test - original data series[train size:] 
# use get predictions and metrics using агіта 
original data arima model, original data test predictions series, arima metrics df - get prediction 
s and metrics using arima( 
original data series, 
int(len(original data series) * split ratio) 
) 
it add the metrics dataframe to the map 
arima metrics map[ticker] - arima metrics df 
it save the metrics dataframe to a csv file 
arima metrics df.to csv(os.path.join(results dir, f'(ticker) агіта metrics.csv')) 
# save test predictions series to a csv file 
original data test predictions series.to csv(os.path.join(results dir, 
f'(ticker) arima original data test pred 
ictions series.csv')) 


# plot orginal data series test and test predictions series 
plt.figure(figsize-(10,6), dpi-100) 

plt.title(f'(ticker): (column name) Prices: ARIMA - Observed Vs Predictions):ARIMA') 
plt.xlabel('Date') 

plt.ylabel(column name) 

plt.plot(original data series, label-'Observed(Actual)', color-'blue') 
plt.plot(original data test predictions series, label-'Predictions', color-'orange') 
plt.xticks(rotation-45) 

it Get the current axes 

ax - plt.gca() 

it Automatically set the number of x-axis ticks 

ax.xaxis.set major locator(MaxNLocator(integer-True)) 

# show legend 

plt.legend(loc-'best') 

#show plot 

plt.show() 


# plot orginal data series test and test predictions series 
plot actual and predicted values(original data series test, 
original data test predictions series, 
f'(ticker): (column name) Prices: ARIMA - Observed Vs Prediction 


5, 

index name-'Date', 

column name-column name, 

save path-plots dir, 

file name-f'(ticker) arima original data test predictions seri 
es.png') 


except ValueError as e: 
print(f"Error occurred: {e}") 


# select perfomance metrics and time metrics from arima metrics map and add to arima select performanc 
e metrics and arima time metrics maps 
arima select performance metrics = {} 
arima time metrics - () 
arima mean performance metrics = {} 
arima mean time metrics - () 
selected performance metrics = ['rmse', 'mape', 'mean absolute error'] 
selected time metrics - ['arima model and fit time in ms', 
'arima prediction time in ms', 
'arima total time for model fit and predict in ms'] 
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def calculate mean metrics(metrics dict, tickers, selected metrics): 
mean metrics - () 
for metric in selected metrics: 
mean metrics[metric] - sum(metrics dict[ticker][metrics dict[ticker]['key'] 
-- metric]['value'].values[0] for ticker in tic 
kers) / len(tickers) 
return mean metrics 


for ticker in tickers: 

it add the selected performance metrics to arima select performance metrics 

arima select performance metrics[ticker] = агіта metrics map[ticker][arima metrics map[ticker]['key 
'].isin(selected performance metrics)] 

it add the selected time metrics to arima time metrics 

агіта time metrics[ticker] = агіта metrics map[ticker][arima metrics map[ticker]['key'].isin(select 
ed time metrics)] 


# Calculate mean performance metrics 

arima mean performance metrics - calculate mean metrics(arima select performance metrics, 
tickers, 
selected performance metrics) 


# Calculate mean time metrics 
arima mean time metrics - calculate mean metrics(arima time metrics, tickers, selected time metrics) 


# pring arima select performance metrics and arima time metrics as tables using tabulate with columns - 
Ticker, RMSE, MAPE, Mean Absolute Error, and title as 'Performance Metrics for ARIMA' 
print('\nAccuracy Metrics for АКІМА: \п') 


all data - [] 

for ticker in tickers: 
ticker data = arima select performance metrics.get(ticker) 
if ticker data is not None and not ticker data.empty: 


all data.append([ticker, 'RMSE', ticker data[ticker data['key'] == 'rmse']['value'].values[@]]) 
all data.append(['', 'MAPE', ticker data[ticker data['key'] == 'mape']['value'].values[0]]) 
all data.append(['','Mean Absolute Error', ticker data[ticker data['key'] == 'mean absolute err 
or']['value'].values[9]]) 
else: 


print(f'No data for {ticker} in агіта select performance теёгісѕ\п') 


headers - ['Ticker', 'Metric', 'Value'] 

merged table - pd.DataFrame(all data, columns-headers) 

print(tabulate(merged table, headers-headers, tablefmt-'orgtbl', showindex-False, floatfmt-".4f", numal 
ign-"right")) 

print('\n') 


print arima_mean_performance_metrics and arima_mean_time_metrics as a table 

with columns - RMSE, МАРЕ, Mean Absolute Error, and title as ‘Average Performance Metrics for ARIMA 
аа 

print('\nAverage Accuracy Metrics for ARIMA:Mn') 

mean rsme агіта = агіта mean performance metrics['rmse'] 

mean mape агіта = агіта mean performance metrics[ 'mape'] 

mean mean absolute error arima - arima mean performance metrics['mean absolute error'] 


table data - [['RMSE', mean rsme arima], ['MAPE', mean mape arima], ['Mean Absolute Error', mean mean a 
bsolute error arima]] 

headers - ['Metric', 'Value'] 

print(tabulate(table data, headers-headers, tablefmt-'orgtbl')) 

print('\n') 


# pring arima_time_metrics as tables using tabulate with columns - Ticker, Model Fit Time(ARIMA), Predi 
ction Time(ARIMA), Total Time for Fit and Predict(ARIMA), and title as 'Time Metrics for ARIMA' 
print('\nTime Metrics for ARIMA:\n') 

all_data = [] 
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for ticker in tickers: 


model fit time агіта = агіта time metrics[ticker][arima time metrics[ticker]['key'] 
-- 'arima model and fit time in ms']['value'].val 
ues[0] 
prediction time агіта = агіта time metrics[ticker][arima time metrics[ticker]['key'] 
-- 'arima prediction time in ms']['value'].value 
s[e] 
total time for model fit and predict агіта = агіта time metrics[ticker][arima time metrics[ticker][ 
'key'] 
-- 'arima total time for mo 
del fit and predict in ms']['value'].values[0] 


all data.append([ticker, 'Model Fit Time(ARIMA)', model fit time arima]) 

all data.append(['', ‘Prediction Time(ARIMA)', prediction time arima]) 

all data.append(['', 'Total Time for Fit and Predict(ARIMA)', total time for model fit and predict _ 
arima]) 


headers - ['Ticker', 'Metric', 'Value'] 
merged table - pd.DataFrame(all data, columns-headers) 
print(merged table) using tabulate 
with columns - Ticker, Model Fit Time(ARIMA), Prediction Time(ARIMA), 
Total Time for Fit and Predict(ARIMA), 
and title as 'Time Metrics for ARIMA 
print(tabulate(merged table, 
headers-headers, 
tablefmt-'orgtbl', 
showindex-False, 
floatfmt-".A4f", 
numalign="right") ) 
print('\n') 


# print arima_mean_time_metrics as a table using tabulate with columns - Model Fit Time(ARIMA), Predict 
ion Time(ARIMA), 

# Total Time for Fit and Predict(ARIMA), and title as 'Average Time Metrics for ARIMA‘ 

print('\nAverage Time Metrics for ARIMA:\n') 

mean model fit time arima - arima mean time metrics['arima model and fit time in ms'] 

mean prediction time arima - arima mean time metrics['arima prediction time in ms'] 

mean total time for model fit and predict arima - arima mean time metrics['arima total time for model f 
it and predict in ms'] 


table data - [['Model Fit Time(ARIMA)', 
mean model fit time агіта], 
['Prediction Time(ARIMA)', 
mean prediction time arima], 
['Total Time for Fit and Predict(ARIMA)', 
mean total time for model fit and predict arima]] 
headers - ['Metric', 'Value'] 
# print the table using tabulate 
print(tabulate(table data, 
headers-headers, 
tablefmt-'orgtbl', 
showindex-False, 
floatfmt-".A4f", 
numalign-"right")) 
print('\n') 
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9.4 Step 4: Perform rolling forecast using LSTM and save the predictions and 


performance metrics to CSV files for further analysis. 


# convert an array of values into a dataset matrix 
def create dataset(dataset, look back-1): 
# dataset is a numpy array that contains the stock prices 
# look back is the number of previous time steps to use as input variables to predict the next time p 
eriod 
# dataX is the input variable while dataY is the output variable; 
dataX, dataY = [], [] 
# dataX contains the previous 20 days of stock prices while dataY contains the stock prices for the n 
ext day 
# if dataset is 100, look back is 20, then the loop will run from 0 to 79 
for i in range(len(dataset)-look back-1): 
# a will contain the stock prices from 0 to 19 in the first iteration, 1 to 20 in the second 
iteration and so on 
а = dataset[i:(i*look back), 0] 
# append the 20 stock prices to dataX at each iteration; 
#so dataX will contain 80 arrays of 20 stock prices each increasing by 1 stock price at each 
iteration 
dataX.append(a) 
# append the stock price for the 21st day to dataY at each iteration; 
iso dataY will contain 80 stock prices increasing by 1 stock price at each iteration 
dataY.append(dataset[i + look back, 01) 
# return dataX and dataY as numpy arrays 
return np.array(dataX), np.array(dataY) 


# Function to build lstm model 
def build lstm model(hp): 
model - Sequential() 
model.add(LSTM(units-hp.Int('units', min value-1, max value-50, step-1), 
batch input shape-(batch size, look back, 1), 
stateful-True)) 
model.add(Dense(1)) 
model.compile(loss-'mean squared error', optimizer-'adam') 
return model 


# get predictions and metrics using lstm with rolling window 
def get predictions and metrics using lstm(actual data indexed, train size): 
org data set - actual data indexed 
data series - org data set.values 
it normalize the dataset 
scaler - MinMaxScaler(feature range-(0, 1)) 
data series - scaler.fit transform(data series.reshape(-1, 1)) 
# split into train and test sets 
train size - int(len(data series) * split ratio) 
test size - len(data series) - train size 
train, test = data series[0:train size,:], data series[train size:len(data series),:] 
train indexed = org data set[0:train size] 
test indexed - org data set[train size:len(org data set)] 
# reshape into X=t and Y=t+1 
trainX, trainY = create dataset(train, look back) 
testX, testY - create dataset(test, look back) 
it reshape input to be [samples, time steps, features] 


# trainX.shape[0] is the number of rows in trainX, trainX.shape[1] is the number of columns in tra 
inX, 1 is the number of features 

trainX - np.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1)) 

it testX.shape[0] is the number of rows in testX, testX.shape[1] is the number of columns in testX, 
1 is the number of features 

testX = np.reshape(testX, (testX.shape[0], testX.shape[1], 1)) 
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00 


it create a stateful LSTM network 
lstm model and fit time in ms - 0 
tuner = GridSearch( 
build lstm model, 
objective-'val loss', 
max trials-5) # Set the total number of trials 
print('finding LSTM best model...') 


# start lstm model and fit time 
start lstm model and fit time - time.time() 
it search for best model 
tuner.search(trainX, 
trainY, 
epochs-5, 
batch size-batch size, 
validation data-(testX, testY) 


# get the best model 
best model = tuner.get best models(1)[0] 
print(best model.summary()) 
# define early stopping callback 
Here, monitor is the quantity to be monitored, 
patience is the number of epochs with no improvement after which training will be stopped, 
verbose is the verbosity mode, 
verbose-0 is silent, verbose=1 is progress bar, 
verbose-2 is one line per epoch 
early stopping - EarlyStopping(monitor-'loss', 
patience-early stopping patience, 
verbose-1) 
# define the LSTM layer 
best model.fit(trainX, trainY, 
epochs-nb epoch, 
batch size-batch size, 
verbose-2, 
shuffle-False, 
callbacks-[early stopping]) 
it end lstm model and fit time 
end lstm model and fit time - time.time() 
# get time used for fit and predict 
lstm model and fit time in ms += (end lstm model and fit time - start lstm model and fit time) * 10 


# implement walk forward validation and get predictions 
lstm predictions - list() 
lstm prediction time in ms = @ 
# walk-forward validation for each time step in the test data set 
i-0 
for obs in test: 
# X, у = testX[i, 0, :], testY[i] 
# reshape input to be [samples, time steps, features] 
і-і-і 
print(" LSTM: Predicting for " + str(i) + "/" + str(len(test))) 
X = trainX[-1, :, :] 
# X = X.reshape(1, 1, len(X)) 
X = X.reshape(1, look back, 1) 
start time lstm prediction - time.time() 
yhat - best model.predict(X, batch size-1) 
end time lstm prediction - time.time() 
lstm prediction time in ms += (end time lstm prediction 
- start time lstm prediction) * 1000 


# update train with the actual value 

updated train = data series[0:train size*i,:] 

# remove the first row from train so that the number of rows in train remains the same 
train = updated train[1:] 
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* 1000 


# recreate trainX and trainY 
trainX, trainY - create dataset(train, look back) 


reshape into X-t and Y=t+1. 

we need it in format [samples, time steps, features] 

trainX is the input variable while trainY is the output variable 
trainX.shape[0] is the number of rows in trainX, 

trainX.shape[1] is the number of columns in trainX, 1 is the number of features 


trainX = np.reshape(trainX, 
(trainX.shape[0], 
trainX.shape[1], 
1)) 


reshape into X-t and Y=t+1. 

we need it in format [samples, time steps, features] 

trainY is the output variable 

trainY.shape[0] is the number of rows in trainY, 

1 is the number of columns in trainY, i.e. the number of features 


trainY - np.reshape(trainY, (trainY.shape[0], 1)) 


# invert scaling to get the actual value 
yhat = scaler.inverse transform(yhat) 

# add to predictions 

lstm predictions.append(yhat[0,0]) 


# Update the model state for the next iteration 
best_model.reset_states() 


# create a stateful LSTM network 
print('finding LSTM best model...') 
# start lstm model and fit time 
start lstm model and fit time - time.time() 
# return best model with updated parameters. 
# This possibly makes the model better than the previous one. 
# But, not necessarily better than the best model overall. 
tuner.search(trainX, 
trainY, 
epochs-5, 
batch size-batch size, 
validation data-(testX, testY) 


# get the best model 
best model = tuner.get best models(1)[0] 
print(best model.summary()) 
it fit the best model 
best model.fit(trainX, 
trainY, 
epochs-nb epoch, 
batch size-batch size, 
verbose-2, 
shuffle-False, 
callbacks-[early stopping] 


# end lstm model and fit time 
end lstm model and fit time - time.time() 
# get time used for fit and predict 


lstm model and fit time in ms += (end lstm model and fit time - start lstm model and fit time) 


# invert scaling for actual 
actual_vaules = test_indexed.values 
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it calculate metric 
lstm rmse - sqrt(mean squared error(actual vaules, lstm predictions)) 
lstm mae - mean absolute error(actual vaules, lstm predictions) 
lstm mape - mean absolute percentage error(actual vaules, lstm predictions) * 100 
lstm max error value - max error(actual vaules, lstm predictions) 
lstm r2 = r2 score(actual vaules, lstm predictions) 
lstm median absolute error value - median absolute error(actual vaules, lstm predictions) 
# get time used for fit and predict 
lstm total time for model fit and predict in ms = lstm model and fit time in ms + lstm prediction t 
ime in ms 
rows = [ 
('key': 'rmse', 'text': 'RMSE', ‘value’: lstm rmse), 
('key': 'mape', 'text': 'MAPE', 'value': 15+т таре}, 
('key': 'r2', 'text': 'R2', 'value': l1stm r2), 


('key': 'max error value', 'text': "Мах Error', 'value': lstm max error value], 
('key': 'mean absolute error', 'text': 'Mean Absolute Error', 'value': lstm тае}, 
('key': 'median absolute error value', 'text': 'Median Absolute Error', 'value': lstm median ab 


solute error value), 
('key': 'lstm total time for model fit and predict in ms', 'text': 'Total Time for Fit and Pred 
ісі", 'value': lstm total time for model fit and predict in ms), 


('key': 'lstm model and fit time in ms', 'text': ‘Total Time for Model and Fit', 'value': lstm. 
model and fit time in ms), 

('key': 'lstm prediction time in ms', 'text': 'Total Time for Prediction', 'value': lstm predic 
tion time in ms) 


] 


lstm metrics df - create metrics dataframe(rows) 
# Convert test predictions list into a Pandas Series 
lstm test predictions series - pd.Series(lstm predictions, index-test indexed.index) 
# return the model, test predictions series and metrics df 
return lstm test predictions series, lstm metrics df 


* crate a table with columns - ticker, RMSE, MAPE, Mean Absolute Error, and for each ticker in tickers 
list, add the ticker and the corresponding values for RMSE, MAPE, Mean Absolute Error 
lstm metrics map - () 


for ticker in tickers: 
data = load data from csv(ticker, data dir) 
it Use the 'data' DataFrame as needed 
print("Loaded data for " + ticker + " from " + data dir + 
adj close data - get column data(data, column name) 


ә as " + ticker + ".csv" 


# lets call the adj close data as ''original data series' so that there is no confusion 

org data set - adj close data 

test data set - org data set[int(len(org data set) * split ratio):] 

# get metrics and predictions using lstm with rolling window 

lstm test predictions series, lstm metrics df = get predictions and metrics using lstm(org data set 
2 

int(len(org 

data set) * split ratio)) 


#save the metrics dataframe to a csv file 
lstm metrics df.to csv(os.path.join(results dir, 
f'(ticker) lstm metrics.csv')) 
# save test predictions series to a csv file 
lstm test predictions series.to csv(os.path.join(results dir, 
f'(ticker) lstm original data test predictions ser 
ies.csv')) 


# add the metrics dataframe to the map 
lstm metrics map[ticker] - lstm metrics df 
it save the metrics dataframe to a csv file 
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# convert test data set to float 63 as the test data set is of type float64 as test data set is of 
type float64 

test data set = test data set.astype('float64') 

lstm test predictions series = lstm test predictions series.astype('float64') 

# reindex test data set and lstm test predictions series 

test data set - test data set.reindex(lstm test predictions series.index) 


# plot the actual and predicted values 
plot actual and predicted values(test data set, 
lstm test predictions series, 
f'(ticker): (column name) : (Actual Vs Predictions):LSTM', 
index name-'Date', 
column name-column name, 
save path-plots dir, 
file name-f'(ticker) lstm original data test predictions series.pn 
в”) 


# select perfomance metrics and time metrics from lstm metrics тар апа add to lstm select performance 
metrics and lstm time metrics maps 

lstm select performance metrics = {} 

lstm time metrics - () 

lstm mean performance metrics = {} 

lstm mean time metrics - () 

selected performance metrics = ['rmse', 'mape', 'mean absolute error'] 

selected time metrics - ['lstm model and fit time in ms', 'lstm prediction time in ms', 'lstm total tim 
e for model fit and predict in ms'] 


for ticker in tickers: 

lstm select performance metrics[ticker] = lstm metrics map[ticker][1stm metrics map[ticker]['key']. 
isin(selected performance metrics)] 

lstm time metrics[ticker] = lstm metrics map[ticker][l1stm metrics map[ticker]['key'].isin(selected | 
time metrics)] 


# Calculate mean performance metrics 

lstm mean performance metrics = calculate mean metrics(lstm select performance metrics, 
tickers, 
selected performance metrics) 


# Calculate mean time metrics 

lstm mean time metrics = calculate mean metrics(lstm time metrics, 
tickers, 
selected time metrics) 


# pring lstm select performance metrics and 15%т time metrics as tables using tabulate with columns - Т 
icker, RMSE, MAPE, Mean Absolute Error, and title as 'Performance Metrics for LSTM' 
print('\Accuracy Metrics for LSTM:\n') 
all_data = [] 
for ticker in tickers: 
rmse lstm = lstm select performance metrics[ticker][lstm select performance metrics[ticker]['key'] 
== 'rmse']['value'].values[@] 
mape lstm = lstm select performance metrics[ticker][1stm select performance metrics[ticker]['key'] 
== 'mape']['value'].values[@] 
mean absolute error lstm = lstm select performance metrics[ticker][lstm select performance metrics[ 
ticker]['key'] 
== 'mean absolute error']['value 
' ]J- values[0] 


all data.append([ticker, 'RMSE', rmse lstm]) 
all data.append(['', 'MAPE', mape lstm]) 
all data.append(['','Mean Absolute Error', mean absolute error lstm]) 


headers - ['Ticker', 'Metric', 'Value'] 
merged table - pd.DataFrame(all data, columns-headers) 
print(tabulate(merged table, headers=headers, 
tablefmt-'orgtbl', 
showindex-False, 
floatfmt-".A4f", 
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numalign-"right")) 
print('\n') 


# print lstm_mean_performance_metrics and lstm mean time metrics as a table 

using tabulate with columns - RMSE, MAPE, Mean Absolute Error, 

and title as 'Average Performance Metrics for LSTM 

print('\nAverage Accuracy Metrics for LSTM:\n') 

mean rsme lstm = lstm mean performance metrics['rmse'] 

mean mape lstm = lstm mean performance metrics['mape'] 

mean mean absolute error lstm - lstm mean performance metrics['mean absolute error'] 


table data - [['RMSE', mean rsme lstm], 

['MAPE', mean mape lstm], 

['Mean Absolute Error', mean mean absolute error lstm]] 
headers - ['Metric', 'Value'] 


print(tabulate(table data, 
headers-headers, 
tablefmt-'orgtbl')) 
print('\n') 


print lstm time metrics as tables using tabulate with columns - Ticker, Model Fit Time(LSTM), Predictio 
n Time(LSTM), 
Total Time for Fit and Predict(LSTM), and title as 'Time Metrics for LSTM 


print('\nTime Metrics for LSTM:\n') 
all_data = [] 
for ticker in tickers: 
model fit time lstm = lstm time metrics[ticker][lstm time metrics[ticker]['key'] 
-- 'lstm model and fit time in ms']['value'].values 
[e] 
prediction time lstm = lstm time metrics[ticker][lstm time metrics[ticker]['key'] 
== 'lstm prediction time in ms']['value'].values[O 
] 
total time for model fit and predict lstm = lstm time metrics[ticker][lstm time metrics[ticker]['ke 
y'] 
-- 'lstm total time for model 
fit and predict in ms']['value'].values[0] 
all data.append([ticker, 'Model Fit Time(LSTM)', model fit time lstm]) 
all data.append(['', 'Prediction Time(LSTM)', prediction time lstm]) 
all data.append(['', 'Total Time for Fit and Predict(LSTM)', total time for model fit and predict 1 
stm]) 


headers - ['Ticker', 'Metric', 'Value'] 

merged table - pd.DataFrame(all data, columns-headers) 

print(tabulate(merged table, headers-headers, tablefmt-'orgtbl', showindex-False, floatfmt-".4f", numal 
ign-"right")) 

print('\n') 


print lstm mean time metrics as a table using tabulate 
with columns - Model Fit Time(LSTM), Prediction Time(LSTM), Total Time for Fit and Predict(LSTM), 
and title as 'Average Time Metrics for LSTM' 


print('\nAverage Time Metrics for LSTM:\n') 

mean model fit time lstm = lstm mean time metrics['lstm model and fit time in ms'] 

mean prediction time lstm - lstm mean time metrics['lstm prediction time in ms'] 

mean total time for model fit and predict lstm = lstm mean time metrics['lstm total time for model fit _ 
and predict in ms'] 


table data - [['Model Fit Time(LSTM)', 
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mean model fit time lstm], 
['Prediction Time(LSTM)', 
mean prediction time lstm], 
['Total Time for Fit and Predict(LSTM)', 
mean total time for model fit and predict lstm]] 
headers - ['Metric', 'Value'] 
print(tabulate(table data, headers-headers, tablefmt-'orgtbl')) 
print('\n') 


9.5 Step 5: Compare the performance of ARIMA and LSTM models using the 


average RMSE, MAE, and MAPE metrics. 


print comparison of performance metrics for ARIMA and LSTM as a table 

using tabulate 

with columns - Metric, ARIMA, LSTM, 

and title as ‘Comparison of Performance Metrics for ARIMA and LSTM', row names the ticker symbols 


print('\nComparison of Accuracy Metrics for ARIMA and LSTM:\n') 
all_data = [] 
for ticker in tickers: 
rmse агіта = агіта select performance metrics[ticker][arima select performance metrics[ticker]['key' 
] 


== 'pmse']['value'].values[0] 
таре агіта = агіта select performance metrics[ticker][arima select performance metrics[ticker]['key' 


== 'mape']['value'].values[0] 
mean absolute error arima - arima select performance metrics[ticker][arima select performance metric 
s[ticker]['key'] 
-- 'mean absolute error']['valu 
e'].values[0] 
rmse lstm = lstm select performance metrics[ticker][lstm select performance metrics[ticker]['key'] 
== 'pmse']['value'].values[0] 
mape lstm = lstm select performance metrics[ticker][1stm select performance metrics[ticker]['key'] 
== 'mape']['value'].values[@] 
mean absolute error lstm = lstm select performance metrics[ticker][lstm select performance metrics[t 
icker]['key'] == 'mean absolute error']['value'].values[0] 


all data.append([ticker, 'RMSE', rmse агіта, rmse lstm]) 
all data.append(['', 'MAPE', mape arima, mape lstm]) 
all data.append(['','Mean Absolute Error', mean absolute error arima, mean absolute error lstm]) 


headers - ['Ticker', 'Metric', 'ARIMA', 'LSTM'] 

merged table - pd.DataFrame(all data, columns-headers) 

print(tabulate(merged table, headers-headers, tablefmt-'orgtbl', showindex-False, floatfmt-".4f", numal 
ign-"right")) 

print('\n') 


# print comparison of time metrics for ARIMA and LSTM as a table 
print('\nComparison of Time Metrics for ARIMA and LSTM:\n') 
all_data = [] 
for ticker in tickers: 
агіта model and fit time in ms = агіта time metrics[ticker][arima time metrics[ticker]['key'] 
-- 'arima model and fit time in ms']['va 
lue'].values[@] 
агіта prediction time in ms = агіта time metrics[ticker][arima time metrics[ticker]['key'] 
== 'arima prediction time in ms']['value']. 
values[@] 
arima_total_time_for_model_fit_and_predict_in_ms = arima_time_metrics[ticker][arima_time_metrics[tic 
ker]['key'] 
-- 'arima total time f 


or model fit and predict in ms']['value'].values[0] 
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lstm model and fit time in ms = lstm time metrics[ticker][1stm time metrics[ticker]['key'] 
== 'lstm model and fit time in ms']['value 
' ]J- values[0] 
lstm prediction time in ms = lstm time metrics[ticker][l1stm time metrics[ticker]['key'] 
-- 'lstm prediction time in ms']['value'].val 
ues[0] 
lstm total time for model fit and predict in ms = lstm time metrics[ticker][lstm time metrics[ticker 
]L'key'] 
== 'lstm total time for. 
model fit and predict in ms']['value'].values[0] 


all data.append([ticker, 'Model Fit Time', 
arima model and fit time in ms, lstm model and fit time in ms]) 
all data.append(["", 'Prediction Time', 
arima prediction time in ms, lstm prediction time in ms]) 
all data.append(['','Total Time for Fit and Predict', 
arima total time for model fit and predict in ms, lstm total time for model fit and 
predict in ms]) 


headers - ['Ticker', 'Metric', 'ARIMA', 'LSTM'] 

merged table - pd.DataFrame(all data, columns-headers) 

print(tabulate(merged table, headers-headers, tablefmt-'orgtbl', showindex-False, floatfmt-".4f", numal 
ign-"right")) 

print('\n') 


# print comparison of average performance metrics for ARIMA and LSTM as a table 
print('\nComparison of Average Accuracy Metrics for ARIMA and LSTM:\n') 

rmse arima = arima_mean_performance_metrics['rmse' ] 

таре агіта = arima mean performance metrics['mape'] 

mean absolute error агіта = агіта mean performance metrics['mean absolute error'] 
rmse lstm = lstm mean performance metrics['rmse'] 

mape lstm = lstm mean performance metrics['mape'] 

mean absolute error lstm - lstm mean performance metrics['mean absolute error'] 


table data - [ 
['Mean RMSE', rmse arima, rmse lstm], 
['Mean MAPE', mape arima, mape lstm], 
['Mean Mean Absolute Error', mean absolute error arima, mean absolute error lstm] 


] 


headers - ['Metric', 'ARIMA', 'LSTM'] 
print(tabulate(table data, headers-headers, tablefmt-'orgtbl', colalign-['center', 'right', 'right'])) 
print('\n') 


# print comparison of average time metrics for ARIMA and LSTM as a table 

print('\nComparison of Average Time Metrics for ARIMA and LSTM:\n') 

arima model and fit time in ms = arima mean time metrics['arima model and fit time in ms'] 

arima prediction time in ms - arima mean time metrics['arima prediction time in ms'] 

arima total time for model fit and predict in ms = агіта mean time metrics['arima total time for model. 
fit and predict in ms'] 

lstm model and fit time in ms = lstm mean time metrics['lstm model and fit time in ms'] 

lstm prediction time in ms - lstm mean time metrics['lstm prediction time in ms'] 

lstm total time for model fit and predict in ms - lstm mean time metrics['lstm total time for model fit 
.and predict in ms'] 


table data - [ 

['Mean Model Fit Time', arima model and fit time in ms, lstm model and fit time in ms], 

['Mean Prediction Time', arima prediction time in ms, lstm prediction time in ms], 

['Mean Total Time for Fit апа Predict', агіта total time for model fit and predict in ms, lstm_ 
total time for model fit and predict in ms] 


] 


headers - ['Metric', 'ARIMA', 'LSTM'] 
print(tabulate(table data, headers-headers, tablefmt-'orgtbl', colalign-['center', 'right', 'right'])) 
print('\n') 
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9.6 Code Acknowledgements 

The code in this appendix was developed with insights from several open-source contributors, 
online resources, forums, and code snippets. These sources collectively contributed to the 
understanding and implementation of specific aspects of the code. Notably, the parameter tuning 
process in ARIMA, using the auto. агіта function provided by the pmdarima library, drew 
inspiration from the works of Sumi, Nissa et al., and Brownlee. Additionally, guidance on 
ARIMA model creation and estimation, including the rolling step, was adapted from Brownlee's 
blog post. The implementation of multi-step forecasts with LSTM followed the approach 
presented by Mingboi, while hyperparameter tuning in LSTM was influenced by Rendyk's 


method and the insights shared by Quant (Ai). 
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